Exploratory Data Analysis (EDA)

1 Introduction

Exploratory Data Analysis (EDA) is one of the fundamental steps in any data science process. It allows us to understand the structure, detect anomalies, and uncover patterns in the data before modeling.

“Without EDA, you’re not doing data science, you’re just guessing.”

EDA combines statistics, programming, and visualization to explore datasets. This report is designed to help you practice these core skills using real-world data.

1.1 Dataset

We will use the movies dataset from vega-datasets, which includes information about thousands of films such as their ratings, genres, duration, and box office revenue.

Let’s load and preview the dataset:

Code
import pandas as pd
import altair as alt
from vega_datasets import data

# Load dataset
movies = data.movies()

# Show first rows
movies.head()
Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
0 The Land Girls 146083.0 146083.0 NaN 8000000.0 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0
1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0
2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0
3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN
4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0

Now, let’s examine the shape (number of rows and columns) of the dataset:

Code
movies.shape
(3201, 16)

This tells us how many entries (rows) and features (columns) are present in the dataset.

1.2 First Steps

Before diving deeper into the data, it’s useful to explore some key metadata:

  • ✅ The column names and their data types
  • ⚠️ The presence of missing values
  • 📊 Summary statistics for numeric columns

1.2.1 Column Names and Data Types

Understanding the structure of the dataset helps us know what type of data we’re dealing with.

Code
movies.dtypes
Title                      object
US_Gross                  float64
Worldwide_Gross           float64
US_DVD_Sales              float64
Production_Budget         float64
Release_Date               object
MPAA_Rating                object
Running_Time_min          float64
Distributor                object
Source                     object
Major_Genre                object
Creative_Type              object
Director                   object
Rotten_Tomatoes_Rating    float64
IMDB_Rating               float64
IMDB_Votes                float64
dtype: object

We can also use .info() for a more complete summary, including non-null counts:

Code
# Overview of the dataset
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3201 entries, 0 to 3200
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Title                   3200 non-null   object 
 1   US_Gross                3194 non-null   float64
 2   Worldwide_Gross         3194 non-null   float64
 3   US_DVD_Sales            564 non-null    float64
 4   Production_Budget       3200 non-null   float64
 5   Release_Date            3201 non-null   object 
 6   MPAA_Rating             2596 non-null   object 
 7   Running_Time_min        1209 non-null   float64
 8   Distributor             2969 non-null   object 
 9   Source                  2836 non-null   object 
 10  Major_Genre             2926 non-null   object 
 11  Creative_Type           2755 non-null   object 
 12  Director                1870 non-null   object 
 13  Rotten_Tomatoes_Rating  2321 non-null   float64
 14  IMDB_Rating             2988 non-null   float64
 15  IMDB_Votes              2988 non-null   float64
dtypes: float64(8), object(8)
memory usage: 400.3+ KB

1.3 Missing Values

Detecting and handling missing values is a critical step in any EDA process. Missing data can bias analysis or break downstream models if not handled properly.

  • Detect patterns in missingness
  • Identify if some columns are almost entirely null
  • Decide whether to drop or impute certain variables

1.3.1 Percentage of Missing Values per Column

Let’s start by computing the percentage of missing values in each column:

Code
nan_percent = movies.isna().mean() * 100
nan_percent_sorted = nan_percent.sort_values(ascending=False).round(2)
nan_percent_sorted
US_DVD_Sales              82.38
Running_Time_min          62.23
Director                  41.58
Rotten_Tomatoes_Rating    27.49
MPAA_Rating               18.90
Creative_Type             13.93
Source                    11.40
Major_Genre                8.59
Distributor                7.25
IMDB_Rating                6.65
IMDB_Votes                 6.65
US_Gross                   0.22
Worldwide_Gross            0.22
Title                      0.03
Production_Budget          0.03
Release_Date               0.00
dtype: float64

1.3.2 Reshaping the Data for Visualization

To visualize missing values with Altair, we need to reshape the data into a long format where each missing value is a row:

Code
movies_nans = movies.isna().reset_index().melt(
    id_vars='index',
    var_name='column',
    value_name="NaN"
)
movies_nans
index column NaN
0 0 Title False
1 1 Title False
2 2 Title False
3 3 Title False
4 4 Title False
... ... ... ...
51211 3196 IMDB_Votes False
51212 3197 IMDB_Votes True
51213 3198 IMDB_Votes False
51214 3199 IMDB_Votes False
51215 3200 IMDB_Votes False

51216 rows × 3 columns

1.3.3 Heatmap of Missing Data

This heatmap shows where missing values occur across rows and columns. Patterns may indicate:

  • Columns with consistently missing values
  • Entire rows with large gaps
  • Correlated missingness between variables

To avoid limitations in the number of rows rendered by Altair, we disable the max rows warning:

Code
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')

Now we can create the heatmap:

Code
alt.Chart(movies_nans).mark_rect().encode(
    alt.X('index:O'),
    alt.Y('column'),
    alt.Color('NaN')
).properties(
    width=1000
)

This plot can help identify columns or rows with critical data issues.

1.3.4 Dropping Columns with High Missing Rate

In many real-world cases, we may decide to remove columns that have too many missing values. Let’s set a threshold of 70%:

Code
threshold_nan = 70 # in percent
cols_to_drop = nan_percent[nan_percent>threshold_nan].index
cols_to_drop
Index(['US_DVD_Sales'], dtype='object')

These columns have more than 70% missing values and may not be useful for analysis.

1.4 Cleaned Dataset

Finally, we drop the selected columns and inspect the updated dataset:

Code
movies_cleaned = movies.drop(columns=cols_to_drop)
movies_cleaned
Title US_Gross Worldwide_Gross Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
0 The Land Girls 146083.0 146083.0 8000000.0 Jun 12 1998 R NaN Gramercy None None None None NaN 6.1 1071.0
1 First Love, Last Rites 10876.0 10876.0 300000.0 Aug 07 1998 R NaN Strand None Drama None None NaN 6.9 207.0
2 I Married a Strange Person 203134.0 203134.0 250000.0 Aug 28 1998 None NaN Lionsgate None Comedy None None NaN 6.8 865.0
3 Let's Talk About Sex 373615.0 373615.0 300000.0 Sep 11 1998 None NaN Fine Line None Comedy None None 13.0 NaN NaN
4 Slam 1009819.0 1087521.0 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction None 62.0 3.4 165.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3196 Zack and Miri Make a Porno 31452765.0 36851125.0 24000000.0 Oct 31 2008 R 101.0 Weinstein Co. Original Screenplay Comedy Contemporary Fiction Kevin Smith 65.0 7.0 55687.0
3197 Zodiac 33080084.0 83080084.0 85000000.0 Mar 02 2007 R 157.0 Paramount Pictures Based on Book/Short Story Thriller/Suspense Dramatization David Fincher 89.0 NaN NaN
3198 Zoom 11989328.0 12506188.0 35000000.0 Aug 11 2006 PG NaN Sony Pictures Based on Comic/Graphic Novel Adventure Super Hero Peter Hewitt 3.0 3.4 7424.0
3199 The Legend of Zorro 45575336.0 141475336.0 80000000.0 Oct 28 2005 PG 129.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 26.0 5.7 21161.0
3200 The Mask of Zorro 93828745.0 233700000.0 65000000.0 Jul 17 1998 PG-13 136.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 82.0 6.7 4789.0

3201 rows × 15 columns

2 Types of Data Analysis in EDA

Understanding the nature of variables and the relationships between them is central to Exploratory Data Analysis (EDA). Depending on the number and type of variables involved, we can classify analysis into three main categories: univariate, bivariate, and multivariate.

Table 1 shows how EDA analysis types vary depending on the number and types of variables involved.

This classification helps guide the selection of appropriate visualization techniques and statistical methods for each case.

Table 1: Types of analysis and variable combinations used in EDA
Analysis Type Variable Types Description Examples
Univariate Categorical One qualitative variable Gender, Product Category
Univariate Quantitative One numerical variable Income, Age, Runtime
Bivariate Categorical – Categorical Two qualitative variables Gender vs Nationality
Bivariate Categorical – Quantitative One qualitative and one numerical Province vs Population
Bivariate Quantitative – Quantitative Two numerical variables Age vs Income
Multivariate 3 or more variables (any mix) Combination of categorical and/or numerical Age vs Income by Gender, etc.

2.1 Univariate Analysis: Quantitative

A univariate analysis focuses on examining a single numeric variable to understand its distribution, shape, central tendency, and spread. One of the most common tools for this is the histogram.

In this case, we’ll explore the distribution of the movie runtime (Running_Time_min).

2.1.1 Basic Histogram

We start by creating a histogram to visualize the distribution of running times:

Code
alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('Running_Time_min',bin=alt.Bin(maxbins=30)),
    alt.Y('count()')
).properties(
    title='Histogram of Movie Runtimes (30 bins)'
)

This chart shows how many movies fall into each time interval (bin). However, histograms can look quite different depending on the number and size of bins used.

2.1.2 Effect of Bin Size

Let’s compare how the histogram shape changes with different bin sizes:

Code
histogram_1 = alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('Running_Time_min',bin=alt.Bin(maxbins=8)),
    alt.Y('count()')
)

histogram_2 = alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('Running_Time_min',bin=alt.Bin(maxbins=10)),
    alt.Y('count()')
)

histogram_1 | histogram_2

Even though both plots use the same data, the choice of bin size changes the visual interpretation. A small number of bins may hide details, while too many bins can make it harder to spot trends.

2.1.3 Density plots, or Kernel Density Estimate (KDE)

Density plots offer a smoothed alternative to histograms. Instead of using rectangular bins to count data points, they estimate the probability density function by placing bell-shaped curves (kernels) at each observation and summing them.

This approach helps reduce the visual noise and jaggedness that can occur in histograms and gives a clearer picture of the underlying distribution.

Code
alt.Chart(movies_cleaned).transform_density(
    'Running_Time_min',
    as_=['Running_Time_min','density'],
).mark_area().encode(
    alt.X('Running_Time_min'),
    alt.Y('density:Q')
).properties(
    title="Movies runtime"
)

2.1.4 Grouped Density plot

We can also compare distributions across groups by splitting the KDE by a categorical variable using the groupby parameter. This helps us see how the distribution differs between categories, such as genres.

Code
selection = alt.selection_single(fields=['Major_Genre'], bind='legend')

alt.Chart(movies_cleaned).transform_density(
    'Running_Time_min',
    groupby=['Major_Genre'],
    as_=['Running_Time_min', 'density'],
).mark_area(opacity=0.5).encode(
    alt.X('Running_Time_min'),
    alt.Y('density:Q', stack=None),
    alt.Color('Major_Genre'),
    opacity=alt.condition(selection, 
        alt.value(1), 
        alt.value(0.05)
    )
).add_selection(
    selection
).properties(
    title="Movies Runtime by Genre (Interactive Filter)"
).interactive()

The transparency (opacity=0.5) allows us to observe overlapping distributions and ensures that small density areas are not completely hidden behind larger ones.

From this plot, we can observe, for example, that Drama movies have runtimes nearly as long as the longest Adventure movies, even though their overall distributions differ.

2.2 Bivariate Analysis: Categorical vs Quantitative

Bivariate analysis examines the relationship between two variables. In this case, we focus on one categorical variable (e.g., genre) and one quantitative variable (e.g., revenue), which is a very common scenario in exploratory data analysis.

This type of analysis is useful to: - Compare average or median values across categories. - Detect outliers or high-variance groups. - Understand distributional differences across categories.

Below are several effective visualizations for this analysis.

2.2.1 Basic Barchart

Bar charts are effective for comparing aggregated values (like the mean) across different groups. However, they hide the distribution and variation within each group.

Code
alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('mean(Worldwide_Gross)'),
    alt.Y("Major_Genre")
).properties(
    title="Average Worldwide Gross by Genre"
)

This bar chart shows the mean Worldwide Gross per genre. It is useful for identifying which genres are more profitable on average, but does not show how spread out the data is.

2.2.2 Tick Plot

To visualize individual data points, we use a tick plot. This helps uncover variability within genres and detect outliers.

Code
alt.Chart(movies_cleaned).mark_tick().encode(
    alt.X('Worldwide_Gross'),
    alt.Y("Major_Genre"),
    alt.Tooltip('Title:N')
).properties(
    title="Individual Gross per Movie by Genre"
)

2.2.3 Heatmaps

Heatmaps can summarize the frequency of data points across both axes (quantitative and categorical) using color intensity. It’s particularly useful for spotting patterns without getting overwhelmed by individual points.

Code
alt.Chart(movies_cleaned).mark_rect().encode(
    alt.X('Worldwide_Gross',bin=alt.Bin(maxbins=100)),
    alt.Y("Major_Genre"),
    alt.Color('count()'),
    alt.Tooltip('count()')
).properties(
    title="Heatmap of Movie Counts by Gross and Genre"
)

This heatmap shows how frequently movies from each genre fall into different revenue ranges.

2.2.4 Boxplot

Boxplots are useful for comparing distributions across categories and identifying outliers. Boxplots summarize a distribution using five statistics:

  • Median (Q2)
  • First Quartile (Q1)
  • Third Quartile (Q3)
  • Lower Whisker (Q1 - 1.5 × IQR)
  • Upper Whisker (Q3 + 1.5 × IQR)
Code
alt.Chart(movies_cleaned).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y("Major_Genre")
).properties(
    title="Boxplot of Worldwide Gross by Genre"
)

2.2.5 Side-by-side: Boxplot and Bar Chart

To contrast aggregated values (bar chart) with the full distribution (boxplot), we can display them together:

Code
bar = alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('mean(Worldwide_Gross)'),
    alt.Y("Major_Genre")
)

box = alt.Chart(movies_cleaned).mark_boxplot().encode(
    alt.X('mean(Worldwide_Gross)'),
    alt.Y("Major_Genre")
)

box | bar

This comparison reveals whether the mean is a good representative of the genre, or whether the data is skewed or contains outliers that affect the average

2.3 Bivariate Analysis: Quantitative vs Quantitative

When analyzing two quantitative (numerical) variables simultaneously, we aim to discover possible relationships, trends, or correlations. This type of bivariate analysis can reveal whether increases in one variable are associated with increases or decreases in another (positive or negative correlation), or if there’s no relationship at all. The most common and intuitive visualization for this is the scatterplot.

2.3.1 Scatterplots

Scatter plots are effective visualizations for exploring two-dimensional distributions, allowing us to identify patterns, trends, clusters, or outliers.

Let’s start by visualizing how movies are rated across two popular online platforms:

Are movies rated similarly on different platforms?

Code
alt.Chart(movies_cleaned).mark_point().encode(
    alt.X('IMDB_Rating'),
    alt.Y('Rotten_Tomatoes_Rating')
).properties(
    title="IMDB vs Rotten Tomatoes Ratings"
)

2.3.2 Scatterplot Saturation

Scatterplots can become saturated when too many points overlap in a small area of the chart, making it difficult to distinguish dense regions from sparse ones. For example, when plotting financial variables like production budget versus worldwide gross:

Code
saturated = alt.Chart(movies_cleaned).mark_point().encode(
    alt.X('Production_Budget'),
    alt.Y('Worldwide_Gross')
).properties(
    title="Saturated Scatterplot: Budget vs Gross"
)
saturated

2.3.3 Using Binned Heatmap to Reduce Saturation

To address saturation, we can bin both variables and use a heatmap where the color intensity represents the number of movies that fall into each rectangular region of the grid. This makes dense areas more interpretable

Code
heatmap_scatter = alt.Chart(movies_cleaned).mark_rect().encode(
    alt.X('Production_Budget', bin=alt.Bin(maxbins=60)),
    alt.Y('Worldwide_Gross', bin=alt.Bin(maxbins=60)),
    alt.Color('count()'),
    alt.Tooltip('count()')
).properties(
    title="Binned Heatmap: Budget vs Gross"
)
heatmap_scatter

2.3.4 Side-by-side Comparison

Compare the raw scatterplot with the heatmap representation:

Code
saturated | heatmap_scatter

2.4 Bivariate Analysis: Categorical vs Categorical

When working with two categorical variables, bivariate analysis helps us understand how categories from one variable relate or are distributed across the other. For example, we might want to know how different movie genres are rated according to the MPAA rating system. Visualization techniques like grouped bar charts and faceted plots can reveal patterns, associations, or class imbalances.

2.4.1 Basic Faceted Bar Chart

We begin by exploring how movies are rated (MPAA_Rating) across different genres (Major_Genre). A faceted bar chart allows us to visualize this relationship by plotting a bar chart per genre, helping to identify genre-specific rating distributions.

Code
alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('count()'),
    alt.Y('MPAA_Rating'),
    alt.Color('MPAA_Rating')
).facet(
    'Major_Genre'
)

2.4.2 Vertical Faceting for Alignment

Faceting horizontally can make comparisons across genres harder when the x-axis is misaligned. By specifying columns=1, we lay out the facets vertically, making it easier to compare counts across genres.

Code
alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('count()'),
    alt.Y('MPAA_Rating'),
    alt.Color('MPAA_Rating')
).facet(
    'Major_Genre',
    columns=1
)

2.4.3 Dependent vs Independent Axis Scaling

By default, facet plots share the same x-axis scale (dependent scale), which allows for easier comparison across panels. However, when the number of observations varies greatly between genres, this shared scale can compress some charts.

We can instead use independent x-axis scaling for each facet. This highlights the relative distribution within each genre.

Code
shared_scale = alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('count()'),
    alt.Y('MPAA_Rating'),
    alt.Color('MPAA_Rating')
).facet(
    'Major_Genre',
    columns=4
)

independent_scale = alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X('count()'),
    alt.Y('MPAA_Rating'),
    alt.Color('MPAA_Rating')
).facet(
    'Major_Genre',
    columns=4
).resolve_scale(x='independent')

shared_scale | independent_scale

The left panel (shared scale) makes absolute comparisons between genres, while the right panel (independent scale) makes within-genre comparisons more readable.

2.4.4 Heatmaps

Heatmaps are effective for visualizing the relationship between two categorical variables when the goal is to display counts or frequency of occurrences. They map the number of observations to color, providing an intuitive view of which category pairs are most or least common.

We can enhance this basic representation by also using marker size, combining both color intensity and circle area to represent counts more effectively. This dual encoding can improve interpretation, especially when printed in grayscale or when there are subtle color differences.

Code
heatmap_color = alt.Chart(movies_cleaned).mark_rect().encode(
    alt.X('MPAA_Rating'),
    alt.Y('Major_Genre', sort='color'),
    alt.Color('count()')
).properties(
    title="Heatmap with Color (Count of Movies)"
)

heatmap_size = alt.Chart(movies_cleaned).mark_circle().encode(
    alt.X('MPAA_Rating'),
    alt.Y('Major_Genre', sort='color'),
    alt.Color('count()'),
    alt.Size('count()')
).properties(
    title="Heatmap with Color + Size (Count of Movies)"
)

heatmap_color | heatmap_size

2.5 Multivariate Analysis

Multivariate analysis helps us understand the interactions and relationships among multiple variables simultaneously. In the context of numerical features, it is useful to explore pairwise distributions, correlations, and detect potential clusters or anomalies.

When the number of variables is large, repeated charts such as histograms or scatter plot matrices help us summarize patterns efficiently and consistently across all numerical dimensions.

2.5.1 Repeated Histograms for Numerical Columns

We first identify and isolate all numerical columns from the dataset. Then we repeat a histogram for each of these columns to understand the individual distributions. This overview is helpful to detect skewness, outliers, or binning decisions that affect how data is grouped visually.

Code
# Select only numerical columns
numerical_columns = movies_cleaned.select_dtypes('number').columns.tolist()
Code
alt.Chart(movies_cleaned).mark_bar().encode(
    alt.X(alt.repeat(),type='quantitative',bin=alt.Bin(maxbins=30)),
    alt.Y('count()')
).properties(
    width=150,
    height=150
).repeat(
    numerical_columns,
    columns=4
)

2.5.2 Scatter Plot Matrix (Pairplot)

A scatter plot matrix shows the pairwise relationships between all numerical variables. This is a common exploratory tool to detect:

  • Correlations between variables
  • Outliers or clusters
  • Relationships useful for prediction models (e.g., to predict rating or budget)

We focus especially on the plots below the diagonal, as they are not duplicated.

Code
alt.Chart(movies_cleaned).mark_point().encode(
    alt.X(alt.repeat('column'),type='quantitative'),
    alt.Y(alt.repeat('row'),type='quantitative'),
    alt.Tooltip('Title:N')
).properties(
    width=100,
    height=100
).repeat(
    column=numerical_columns,
    row=numerical_columns
)

2.5.3 Heatmap Matrix

When scatter plots become too saturated (many overlapping points), heatmaps offer a better alternative by binning the numeric values and encoding the count in color intensity.

Code
alt.Chart(movies_cleaned).mark_rect().encode(
    alt.X(alt.repeat('column'),type='quantitative',bin=alt.Bin(maxbins=30)),
    alt.Y(alt.repeat('row'),type='quantitative',bin=alt.Bin(maxbins=30)),
    alt.Color('count()'),
    alt.Tooltip('count()')
).properties(
    width=100,
    height=100
).repeat(
    column=numerical_columns,
    row=numerical_columns
).resolve_scale(
    color='independent'
)

To gain deeper insights into the dataset, it’s important to analyze how numerical variables behave across different categories. This type of multivariate analysis allows us to:

  • Compare distributions across categories
  • Detect outliers within categories
  • Observe central tendency (median, quartiles) and spread (range, IQR)

Boxplots are particularly effective for this purpose. In the following visualizations, we explore these relationships by repeating plots across combinations of categorical and numerical features.

2.5.4 Filter Categorical Columns

First, we select the relevant categorical columns, excluding identifiers and text-heavy variables like movie titles or director names.

Code
categorical_columns =  movies_cleaned.select_dtypes('object').columns.to_list()

categorical_columns_remove = ['Title','Release_Date','Distributor','Director']

categorical_filtered = [col for col in categorical_columns if col not in categorical_columns_remove]

2.5.5 Repeated Boxplots: Categorical vs Numerical

We repeat boxplots using combinations of categorical (rows) and numerical (columns) features. This matrix layout gives a clear visual overview of how numerical values are distributed within each category.

Code
alt.Chart(movies_cleaned).mark_boxplot().encode(
    alt.X(alt.repeat('column'),type='quantitative'),
    alt.Y(alt.repeat('row'),type='nominal'),
    alt.Size('count()')
).properties(
    width=200,
    height=200
).repeat(
    column=numerical_columns,
    row=categorical_filtered
)

2.5.6 Faceted Boxplots

For more focused analysis, we can facet the boxplots using a specific categorical variable like MPAA_Rating, and repeat the chart by different categorical rows. This lets us keep the numerical axis fixed while comparing how categories vary across different classes (e.g., movie ratings).

Code
alt.Chart(movies_cleaned).mark_boxplot().encode(
    alt.X('Running_Time_min', type='quantitative'),
    alt.Y(alt.repeat('row'),type='nominal'),
    alt.Size('count()'),
    alt.Tooltip('Title:N')
).properties(
    width=100,
    height=100
).facet(
    column='MPAA_Rating'
).repeat(
    row=categorical_filtered
)